Prediction of leaderboard score (R)


In [1]:
score_data = read.csv('../input/scores.csv',stringsAsFactors=FALSE)
score_data[with(score_data, order(leaderboard_score)), ]


Out[1]:
modelleaderboard_scoreaccuracyloglossAUCf1mustd
19bagged_nolearn 0.43130.78137.55540.58570.3152NANA
2ensemble of averages 0.437NANANANANANA
16voting_ensemble_softWgtd 0.43960.78657.37550.60650.36920.760.0013
20LogisticRegression 0.44110.77087.91520.5540.22350.76010.0013
22bagged_logit 0.44420.7837.49540.57940.29380.760.0015
3GradientBoostingClassifier 0.44520.79347.13560.63090.42510.75380.0047
21LogisticRegressionCV 0.44570.7837.49540.57940.29380.76020.0012
25bagged_scikit_nn 0.44650.79866.95580.6740.50850.74630.0065
17bagged_gbc 0.45270.78997.25560.61370.38580.75730.0037
1nolearn 0.45660.80566.71590.67110.5044NANA
4ExtraTreesClassifier 0.47290.7767.73530.59960.35820.75260.0061
26blending_ensemble 0.4834NANANANANANA
5XGBClassifier 0.48510.77437.79530.60340.36890.74740.0065
13BaggingClassifier 0.48850.77437.79520.59850.35640.75230.0051
24scikit_nn 0.5020.79697.01570.68030.51850.73870.0116
18boosted_svc 0.53340.76917.97510.52060.08280.76020.0012
11SVC 0.53360.75528.45480.52640.14550.7420.0079
6SGDClassifier 0.5670.72749.41430.55280.27650.63390.069
7cosine_similarity 0.57320.79067.23330.64520.4595NANA
23boosted_logit 0.58910.77087.91520.57880.30530.75890.002
12KMeans 0.62890.689210.73350.54250.2869NANA
8AdaBoostClassifier 0.66420.7837.49540.61910.40190.75740.0036
9KNeighborsClassifier 1.1870.77787.67530.62060.40740.73870.0086
10RandomForestClassifier 1.79070.73449.17440.57220.320.69540.0149
14voting_ensemble_hard NA0.78827.31550.60760.37110.75940.0017
15voting_ensemble_hardWgtd NA0.78137.55540.57330.27590.760.0012

Model using all variables


In [2]:
lm.fit = lm(leaderboard_score ~ accuracy + logloss + AUC + f1 + mu + std, 
            data      = score_data, 
            na.action = na.omit)

slm.fit = step(lm.fit, direction = "both")
summary(slm.fit)


Start:  AIC=-60.18
leaderboard_score ~ accuracy + logloss + AUC + f1 + mu + std

           Df Sum of Sq     RSS     AIC
- AUC       1  0.000051 0.29209 -62.179
- f1        1  0.004574 0.29662 -61.902
- accuracy  1  0.022137 0.31418 -60.867
- logloss   1  0.022258 0.31430 -60.860
<none>                  0.29204 -60.182
- mu        1  0.107786 0.39983 -56.528
- std       1  0.254788 0.54683 -50.892

Step:  AIC=-62.18
leaderboard_score ~ accuracy + logloss + f1 + mu + std

           Df Sum of Sq     RSS     AIC
- accuracy  1   0.02267 0.31476 -62.834
- logloss   1   0.02282 0.31492 -62.825
- f1        1   0.03171 0.32380 -62.324
<none>                  0.29209 -62.179
+ AUC       1   0.00005 0.29204 -60.182
- mu        1   0.15368 0.44577 -56.570
- std       1   0.32770 0.61980 -50.637

Step:  AIC=-62.83
leaderboard_score ~ logloss + f1 + mu + std

           Df Sum of Sq     RSS     AIC
- logloss   1   0.02081 0.33557 -63.681
- f1        1   0.02103 0.33579 -63.669
<none>                  0.31476 -62.834
+ accuracy  1   0.02267 0.29209 -62.179
+ AUC       1   0.00058 0.31418 -60.867
- mu        1   0.30384 0.61860 -52.672
- std       1   0.62665 0.94141 -45.114

Step:  AIC=-63.68
leaderboard_score ~ f1 + mu + std

           Df Sum of Sq     RSS     AIC
- f1        1   0.00139 0.33695 -65.607
<none>                  0.33557 -63.681
+ logloss   1   0.02081 0.31476 -62.834
+ accuracy  1   0.02065 0.31492 -62.825
+ AUC       1   0.01626 0.31931 -62.575
- std       1   1.33496 1.67052 -36.790
- mu        1   1.61961 1.95517 -33.958

Step:  AIC=-65.61
leaderboard_score ~ mu + std

           Df Sum of Sq     RSS     AIC
<none>                  0.33695 -65.607
+ f1        1   0.00139 0.33557 -63.681
+ logloss   1   0.00116 0.33579 -63.669
+ accuracy  1   0.00115 0.33581 -63.669
+ AUC       1   0.00017 0.33678 -63.616
- std       1   1.33570 1.67265 -38.767
- mu        1   1.61917 1.95612 -35.949
Out[2]:
Call:
lm(formula = leaderboard_score ~ mu + std, data = score_data, 
    na.action = na.omit)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.18728 -0.05472 -0.03539  0.02082  0.42898 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   25.722      2.962   8.685 3.09e-07 ***
mu           -33.089      3.897  -8.490 4.11e-07 ***
std          -60.589      7.857  -7.711 1.35e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1499 on 15 degrees of freedom
  (8 observations deleted due to missingness)
Multiple R-squared:  0.8311,	Adjusted R-squared:  0.8086 
F-statistic: 36.91 on 2 and 15 DF,  p-value: 1.61e-06

Plot Predicted Scores vs Actual Leaderboard Scores


In [3]:
predictions = c()
models = c()
scores = c()
for (i in 1:nrow(score_data)) {
    if (is.na(score_data[i,'std'])) {next}
    if (score_data[i,'model']=='RandomForestClassifier     ') {next} # a far outlier
    if (score_data[i,'model']=='KNeighborsClassifier       ') {next} # a far outlier
    
#     print(paste0("|",score_data[i,'model'],"|"))
    
    
    models = c(models, score_data[i,'model'])
    scores = c(scores, score_data[i,'leaderboard_score'])
    
    accuracy = score_data[i,'accuracy']
    logloss  = score_data[i,'logloss']
    AUC      = score_data[i,'AUC']
    f1       = score_data[i,'f1']
    mu       = score_data[i,'mu']
    std      = score_data[i,'std']
    predictions = c(predictions, round(predict(object=slm.fit,
                                 newdata  = data.frame(accuracy,logloss,AUC,f1,mu,std)),4))
}
pred_v_act = data.frame(models,scores,predictions)
pred_v_act


Out[3]:
modelsscorespredictions
1GradientBoostingClassifier 0.44520.4947
2ExtraTreesClassifier 0.47290.4496
3XGBClassifier 0.48510.5974
4SGDClassifier 0.5670.5662
5AdaBoostClassifier 0.66420.4422
6SVC 0.53360.6912
7BaggingClassifier 0.48850.5201
8voting_ensemble_hard NA0.4911
9voting_ensemble_hardWgtd NA0.5016
10voting_ensemble_softWgtd 0.43960.4955
11bagged_gbc 0.45270.4394
12boosted_svc 0.53340.495
13LogisticRegression 0.44110.4922
14LogisticRegressionCV 0.44570.495
15bagged_logit 0.44420.4834
16boosted_logit 0.58910.4895
17scikit_nn 0.5020.5763
18bagged_scikit_nn 0.44650.6338

In [4]:
#par(pin=c(6,6))
library(car)
plot(pred_v_act[,'predictions'], pred_v_act[,'scores'], main="Predicted Score v Leaderboard", 
    ylab="Leaderboard (worse ->)", xlab="Predicted Score", pch=19)#, xlim=c(0.25,1.8),ylim=c(0.25,1.8))
text(pred_v_act[,'predictions'], pred_v_act[,'scores'], labels=models, cex= 0.6)
abline(coef=c(0,1))



In [5]:
score_data = score_data[with(score_data, order(leaderboard_score)), ]

library(knitr)
foo = kable(score_data, format = "markdown", digits = 4)
foof = ''
for (i in 1:length(foo)) {
    subs = substr(foo[i],5,52)
    foof = cat(foof,cat(subs,'\n'))
}
foof


|model                      | leaderboard_score| 
|:--------------------------|-----------------:| 
|bagged_nolearn             |            0.4313| 
|ensemble of averages       |            0.4370| 
|voting_ensemble_softWgtd   |            0.4396| 
|LogisticRegression         |            0.4411| 
|bagged_logit               |            0.4442| 
|GradientBoostingClassifier |            0.4452| 
|LogisticRegressionCV       |            0.4457| 
|bagged_scikit_nn           |            0.4465| 
|bagged_gbc                 |            0.4527| 
|nolearn                    |            0.4566| 
|ExtraTreesClassifier       |            0.4729| 
|blending_ensemble          |            0.4834| 
|XGBClassifier              |            0.4851| 
|BaggingClassifier          |            0.4885| 
|scikit_nn                  |            0.5020| 
|boosted_svc                |            0.5334| 
|SVC                        |            0.5336| 
|SGDClassifier              |            0.5670| 
|cosine_similarity          |            0.5732| 
|boosted_logit              |            0.5891| 
|KMeans                     |            0.6289| 
|AdaBoostClassifier         |            0.6642| 
|KNeighborsClassifier       |            1.1870| 
|RandomForestClassifier     |            1.7907| 
|voting_ensemble_hard       |                NA| 
|voting_ensemble_hardWgtd   |                NA| 
Out[5]:
NULL

In [ ]: